String Matching in the DNA Alphabet
نویسندگان
چکیده
Searching for occurrences of string patterns is a common problem in many applications. Various good solutions have been presented for string matching. The most efficient solutions in practice are based on the Boyer–Moore algorithm.1 A typical question in molecular biology is whether a given sequence has appeared elsewhere. In the following, we will concentrate on searching for exact occurrences of long patterns in the DNA alphabet which in a typical case contains four characters, namely a, c, g, and t. However, the biologists are often interested in finding similar sequences. Nevertheless, exact searching can be used as a fast subroutine of approximate searching. At low error levels any algorithm for exact searching can be used as a fast filtering method. Assume that we allow e errors. If we divide the pattern in e+1 distinct blocks, every approximate occurrence contains an exact occurrence of at least one of the blocks. Thus an occurrence of any block defines a potential approximate occurrence of the pattern, which can be checked with a slower dynamic programming method. Hume and Sunday3 review several techniques how to improve the practical efficiency of the Boyer–Moore algorithm using different shift heuristics, tight loops, unrolling of loops, and some other approaches. Their study mainly deals with searching for words of an English text, but they also tested their algorithms on DNA strings. Later Kim and Shawe-Taylor present more efficient solutions for DNA strings based on an implementation of an algorithm introduced by Baeza-Yates.5 We will introduce a new version of the Baeza-Yates algorithm,4,5 which is a modification of the Boyer–Moore–Horspool algorithm6 for small alphabets. In the Baeza-Yates algorithm the
منابع مشابه
انتخاب کوچکترین ابر رشته در DNA با استفاده از الگوریتم ازدحام ذرّات
A DNA string can be supposed a very long string on alphabet with 4 letters. Numerous scientists attempt in decoding of this string. since this string is very long , a shorter section of it that have overlapping on each other will be decoded .There is no information for the right position of these sections on main DNA string. It seems that the shortest string (substring of the main DNA string) i...
متن کاملOn-line string matching algorithms: survey and experimental results
In this paper we present a short survey and experimental results for well known sequential string matching algorithms. We consider algorithms based on different approaches including classical, suffix automata, bit-parallelism and hashing. We put special emphasis on algorithms recently presented such as Shift-Or and BNDM algorithms. We compare these algorithms in terms of the number of character...
متن کاملApproximate String Matching with Reduced Alphabet
We present a method to speed up approximate string matching by mapping the factual alphabet to a smaller alphabet. We apply the alphabet reduction scheme to a tuned version of the approximate Boyer– Moore algorithm utilizing the Four-Russians technique. Our experiments show that the alphabet reduction makes the algorithm faster. Especially in the k-mismatch case, the new variation is faster tha...
متن کاملA Fast Generic Sequence Matching Algorithm
A string matching—andmore generally, sequence matching—algorithm is presented that has a linear worst-case computing time bound, a low worst-case bound on the number of comparisons (2n), and sublinear average-case behavior that is better than that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its efficiency advantages in a wide variety of sequence matching problems...
متن کاملExact Multiple String Matching Problem for DNA Alphabet
Given a text T = t1t2 ... tn and a set of patterns P = {P1, P2, ..., Pr}, the exact multiple string matching problem (EMSMP) finds the ending positions of all sub-strings in T which is equal to Pi for 1 i r. We regard all substrings in T and patterns in P as data points in an edit distance-based metric space. The data points in T are constructed into a vantage point tree (vp-tree) T. Then, ...
متن کاملAn Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set
String matching is a problem of finding all occurrences of a short pattern on a relatively long reference string. While a number of methods have been presented, most published implementations assume several restrictions due to some practical issues. We focus on the restriction of the alphabet size, which is usually set to be 256 in many string matching libraries. When strings must be handled ov...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Softw., Pract. Exper.
دوره 27 شماره
صفحات -
تاریخ انتشار 1997